HttpClient 4.5.x教程

简介

之前一直使用httpclient进行爬虫的接口,但是一直对实现的底层原理不是很理解。尤其是httpclient这个对象十分庞大。
对这个类的详细的理解是出于对http底层的原理有了一定的理解之后,想着java这边对httpclient的封装又是如何支持http协议的,出于这个好奇心,就花了一天的时间去研究,总算有所收获。

如何构建一个可做请求的Httpclient

这节的内容主要参考了webmagic的源码里面的构造,主要涉及到三个主要对象的构建,HttpClient,HttpUriRequest,HttpClientContext。

HttpClient对象

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
private CloseableHttpClient generateClient(Site site) {
HttpClientBuilder httpClientBuilder = HttpClients.custom();
httpClientBuilder.setConnectionManager(this.connectionManager);
if(site.getUserAgent() != null) {
httpClientBuilder.setUserAgent(site.getUserAgent());
} else {
httpClientBuilder.setUserAgent("");
}
if(site.isUseGzip()) {
httpClientBuilder.addInterceptorFirst(new HttpRequestInterceptor() {
public void process(HttpRequest request, HttpContext context) throws HttpException, IOException {
if(!request.containsHeader("Accept-Encoding")) {
request.addHeader("Accept-Encoding", "gzip");
}
}
});
}
httpClientBuilder.setRedirectStrategy(new CustomRedirectStrategy());
Builder socketConfigBuilder = SocketConfig.custom();
socketConfigBuilder.setSoKeepAlive(true).setTcpNoDelay(true);
socketConfigBuilder.setSoTimeout(site.getTimeOut());
SocketConfig socketConfig = socketConfigBuilder.build();
httpClientBuilder.setDefaultSocketConfig(socketConfig);
this.connectionManager.setDefaultSocketConfig(socketConfig);
httpClientBuilder.setRetryHandler(new DefaultHttpRequestRetryHandler(site.getRetryTimes(), true));
this.generateCookie(httpClientBuilder, site);
return httpClientBuilder.build();
}

解析: 上述代码主要是设置了UserAgent,Socket的一些基本配置,DefaultHttpRequestRetryHandler,HttpRequestInterceptor,ConnectionManager。
其中对于ConnectionManager的设置,我这边进行了深入的研究。

ConnectionManager

1
2
3
4
5
6
7
8
private PoolingHttpClientConnectionManager connectionManager;
public HttpClientGenerator() {
Registry<ConnectionSocketFactory> reg = RegistryBuilder.create().register("http", PlainConnectionSocketFactory.INSTANCE).register("https", this.buildSSLConnectionSocketFactory()).build();
this.connectionManager = new PoolingHttpClientConnectionManager(reg);
this.connectionManager.setDefaultMaxPerRoute(100);
this.connectionManager.setMaxTotal(200);
}

DefaultMaxPerRoute代表一个TCP连接最大可以连接的不同远程主机(ip+端口)最大数量为100

MaxTotal代表主机最大的连接数相当于最大可以开200个端口进行tcp连接。

PoolingHttpClientConnectionManager与BasicHttpClientConnectionManager的区别:
前者代表多线程,后者代表单线程。

HttpUriRequest

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
private HttpUriRequest convertHttpUriRequest(Request request, Site site, Proxy proxy) {
RequestBuilder requestBuilder = this.selectRequestMethod(request).setUri(request.getUrl());
if(site.getHeaders() != null) {
Iterator var5 = site.getHeaders().entrySet().iterator();
while(var5.hasNext()) {
Entry<String, String> headerEntry = (Entry)var5.next();
requestBuilder.addHeader((String)headerEntry.getKey(), (String)headerEntry.getValue());
}
}
Builder requestConfigBuilder = RequestConfig.custom();
if(site != null) {
requestConfigBuilder.setConnectionRequestTimeout(site.getTimeOut()).setSocketTimeout(site.getTimeOut()).setConnectTimeout(site.getTimeOut()).setCookieSpec("standard");
}
if(proxy != null) {
requestConfigBuilder.setProxy(new HttpHost(proxy.getHost(), proxy.getPort()));
}
requestBuilder.setConfig(requestConfigBuilder.build());
HttpUriRequest httpUriRequest = requestBuilder.build();
if(request.getHeaders() != null && !request.getHeaders().isEmpty()) {
Iterator var7 = request.getHeaders().entrySet().iterator();
while(var7.hasNext()) {
Entry<String, String> header = (Entry)var7.next();
httpUriRequest.addHeader((String)header.getKey(), (String)header.getValue());
}
}
return httpUriRequest;
}

主要是设置这次请求的Proxy和Header。

HttpClientContext

主要用于设置这次请求需要的cookie。

如果需要使用HttpClient池,并且想要做到一次登录的会话供多个HttpClient连接使用,就需要自己保存会话信息。因为客户端的会话信息是保存在cookie中的(JSESSIONID),所以只需要将登录成功返回的cookie复制到各个HttpClient使用即可。

使用Cookie的方法有两种,可以自己使用CookieStore来保存(见TestCookieStore()方法),也可以通过HttpClientContext上下文来维持(见TestContext()方法)。

使用CookieStore:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
@Test
public void testCookieStore() throws Exception {
System.out.println("----testCookieStore");
// 使用cookieStore方式
CloseableHttpClient client = HttpClients.custom()
.setDefaultCookieStore(cookieStore).build();
HttpGet httpGet = new HttpGet(testUrl);
System.out.println("request line:" + httpGet.getRequestLine());
try {
// 执行get请求
HttpResponse httpResponse = client.execute(httpGet);
System.out.println("cookie store:" + cookieStore.getCookies());
printResponse(httpResponse);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
// 关闭流并释放资源
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
public static void printResponse(HttpResponse httpResponse)
throws ParseException, IOException {
// 获取响应消息实体
HttpEntity entity = httpResponse.getEntity();
// 响应状态
System.out.println("status:" + httpResponse.getStatusLine());
System.out.println("headers:");
HeaderIterator iterator = httpResponse.headerIterator();
while (iterator.hasNext()) {
System.out.println("\t" + iterator.next());
}
// 判断响应实体是否为空
if (entity != null) {
String responseString = EntityUtils.toString(entity);
System.out.println("response length:" + responseString.length());
System.out.println("response content:"
+ responseString.replace("\r\n", ""));
}
}

使用context方式

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
@Test
public void testContext() throws Exception {
System.out.println("----testContext");
// 使用context方式
CloseableHttpClient client = HttpClients.createDefault();
HttpGet httpGet = new HttpGet(testUrl);
System.out.println("request line:" + httpGet.getRequestLine());
try {
// 执行get请求
HttpResponse httpResponse = client.execute(httpGet, context);
System.out.println("context cookies:"
+ context.getCookieStore().getCookies());
printResponse(httpResponse);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
// 关闭流并释放资源
client.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}

TCP连接

端对端的连接,端口到端口的连接。

用wireshark抓包发现,每一个http请求都是先发起一个tcp的连接,每个tcp都带有一个端口。

端口的复用

[Full request URI: http://www.cnblogs.com/cate/java/] :Source Port: 55212 Destination Port: 80

[Full request URI: http://www.cnblogs.com/bundles/aggsite.css?v=IhfFDNk6saBQuSizNqMno4eFb5L3OoXlsUCqkaSgNvA1] :Source Port: 55212 Destination Port: 80

[Full request URI: http://www.cnblogs.com/bundles/aggsite.js?v=vWqa5z-vvnUBiauXGl6S0-ZbtOAq_fbE-A1hKZngtlw1] Source Port: 55213 Destination Port: 80

当用chrome访问一个链接http://www.cnblogs.com/cate/java/时,其中一个端口55212分别进行了两次http的获取,一次是html,一次是css。

参考

httpclient 4.5.x官方文档

WebMagic in Action

HttpClient4.x 使用cookie保持会话

欢迎大家关注:huazi's微信公众号